| Class | Duplicate Count | Total Images | Proportion | |
|---|---|---|---|---|
| 0 | Agaricus | 2 | 353 | 0.005666 |
| 1 | Amanita | 2 | 750 | 0.002667 |
| 2 | Boletus | 2 | 1073 | 0.001864 |
| 3 | Cortinarius | 2 | 836 | 0.002392 |
| 4 | Entoloma | 0 | 364 | 0.000000 |
| 5 | Hygrocybe | 1 | 316 | 0.003165 |
| 6 | Lactarius | 63 | 1563 | 0.040307 |
| 7 | Russula | 4 | 1147 | 0.003487 |
| 8 | Suillus | 0 | 311 | 0.000000 |
| 9 | Total | 76 | 6713 | 0.011321 |
There are some corrupt and unreadable images in the dataset that also need to be removed:
Removing corrupt image: dataset_temp\Mushrooms\Russula\092_43B354vYxm8.jpg due to image file is truncated (92 bytes not processed)
[WindowsPath('dataset_temp/Mushrooms/Russula/092_43B354vYxm8.jpg')]
Image Analysis¶
Running on 28 workers
Processing images: 100%|██████████| 6637/6637 [00:26<00:00, 248.23it/s]
Total processing time: 26.97 seconds
Image color summary¶
There seem to be no grayscale images and all images have 3 color channels.
| Color Type | Count | |
|---|---|---|
| 0 | Color | 6637 |
Dimensions¶
There seems to be alot of variance between images sizes and aspect ratios this is a significant concern because models like ResNet, require fixed input sizes (224x224 in this case)
We are using FastAI's 'ImageDataLoaders' which handles the resizing and add padding to samples which are not square.
| Width | Height | Aspect Ratio | |
|---|---|---|---|
| count | 363.000000 | 505.000000 | 1156.000000 |
| mean | 709.586777 | 623.914851 | 1.325247 |
| std | 207.637554 | 183.902749 | 0.237855 |
| min | 259.000000 | 152.000000 | 0.561250 |
| 25% | 572.500000 | 487.000000 | 1.226641 |
| 50% | 702.000000 | 613.000000 | 1.336744 |
| 75% | 797.500000 | 749.000000 | 1.462122 |
| max | 1280.000000 | 1024.000000 | 2.857143 |
Color Variance and Entropy¶
We'll further analyze the images using these metrics:
Average variance of color channels in the all images:
- Variance = 0: All pixels in the image have the same color.
- High Variance: Indicates images with diverse color pixels.
Number of unique colors in each image
Entropy (shannon_entropy).
- Scale: 0 to log2(N), where N is the number of possible pixel values (0 to 8 for 256 grayscale values).
- Min Entropy = 0: Perfectly uniform image (single color).
- High Entropy: Indicates images with a wide variety of colors and patterns.
- Scale: 0 to log2(N), where N is the number of possible pixel values (0 to 8 for 256 grayscale values).
| Variance | Unique Color | Entropy | |
|---|---|---|---|
| count | 6637.000000 | 6637.000000 | 6637.000000 |
| mean | 3339.696198 | 9117.627844 | 7.575777 |
| std | 1040.674133 | 858.859522 | 0.289316 |
| min | 0.000000 | 1.000000 | 0.000000 |
| 25% | 2629.536395 | 8884.000000 | 7.488475 |
| 50% | 3218.788643 | 9367.000000 | 7.625352 |
| 75% | 3928.063424 | 9661.000000 | 7.732022 |
| max | 11091.372221 | 9968.000000 | 7.979296 |
This information will be used to filter out the samples which might not be useful for image analysis
Text(0.5, 1.01, 'Entropy Distribution by Class')
Samples with Very Low Entropy¶
The images below have very low entropy (i.e. in the bottom 0.5th percentile).
We can see that while some images are actual mushrooms that were photographed against a single color background (probably in a studio etc.) the image with 0.0 entropy is not valid. Additionally, we can see that one of the images is not an actual mushroom but just a random pattern (unfortunately our color variance based approach is not particularly useful at identifying images as such unless the pattern is very simple)
High Entropy Images¶
These images tend to show very colorful mushrooms and have colorful varying forest backgrounds.
Color Chanel Distribution by Class¶
These plots show the normalized intensity (0 - 255) distributions of color channel by class. The Y show the normalized frequency (density) relative to all color channels (based on highest individual value for any channel).
The charts are made by generating a histogram for each image, normalizing it (normalization process maintains the shape of the histogram, meaning the relative distribution of pixel intensities is preserved) All histograms in the class are then averaged.
As we would expect greend and red tend to be dominant in most images which reflects the color of most images type and forest floor background they tend to be photographed against.
Data Sampling¶
Our next set is to split the dataset into test and training samples roughly trying to maintain an around 15% test sample size. The sample ratios used in our model training:
- Test: 15% of full dataset:
- Validation: 20% of remaining samples (i.e. 17% of full dataset)
- Remaining images used for training: 64%
We're using stratification to maintain an around a relatively representative distribution for all classes. The invalid images found during our analysis are also excluded
Verification Passed: No overlapping files between train and test sets.
| Training Samples | Training Proportion (%) | Testing Samples | Testing Proportion (%) | Total Samples | |
|---|---|---|---|---|---|
| Boletus | 909 | 84.95% | 161 | 15.05% | 1070 |
| Suillus | 264 | 84.89% | 47 | 15.11% | 311 |
| Cortinarius | 709 | 85.01% | 125 | 14.99% | 834 |
| Russula | 972 | 85.04% | 171 | 14.96% | 1143 |
| Agaricus | 298 | 84.90% | 53 | 15.10% | 351 |
| Amanita | 636 | 85.03% | 112 | 14.97% | 748 |
| Entoloma | 309 | 84.89% | 55 | 15.11% | 364 |
| Hygrocybe | 268 | 85.08% | 47 | 14.92% | 315 |
| Lactarius | 1274 | 84.99% | 225 | 15.01% | 1499 |